Learning value functions from samples from the World Model

1 Introduction

The value-based methods previously presented all presuppose a known world model (i.e., the transition probability distribution is known). However, in most RL applications—especially in the real world—the world model isn’t known; therefore we rely on estimation by having the agent interact with the environment and using the sampled data from those interactions.

There are two well-tested methods used in this regard: Monte Carlo control and Temporal Difference learning.

2 Monte Carlo control

Monte Carlo control consists of two steps:

Estimation: we roll out a policy from an initial state (or from an initial state–action pair if we’re estimating the Q-function), collect rewards, and—once the discount factor makes later rewards negligible or the episode ends—we update our estimate of the value function using the average discounted return:

\[ V(s_t) \leftarrow V(s_t) + \eta \bigl[ G_t - V(s_t) \bigr]. \]
Improvement: once we have an estimate, we use policy iteration to learn an optimal policy by making a greedy selection on update:

\[ \pi_{k+1}(s) = \arg\max_a Q_k(s,a), \] with \( Q_k(s,a) \) approximated with Monte Carlo estimation.

3 Temporal Difference (TD) Learning

Temporal Difference learning reduces Bellman error for sampled states (or state–action pairs) during transitions:

\[ V(s_t) \leftarrow V(s_t) + \eta \left[ r_t + \gamma V(s_{t+1}) - V(s_t) \right], \]

where \( V(s_t) \) is moved toward the target value \( r_t + \gamma V(s_{t+1}) \).

4 Combining Monte Carlo and TD

Monte Carlo and TD are often combined. TD updates at each transition, whereas MC waits until an episode finishes (or the horizon is effectively reached). We can keep the TD estimation at each transition and, for a chosen \( n \), approximate the remainder of the trajectory with a weighted average of \( n \)-step returns parameterized by \( \lambda \in [0,1] \):

\[ G_t^{\lambda} = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}. \]

This is the lambda-return, which we use in place of \( G_{t:t+n} \) for subsequent TD updates. Notable cases:

\( \lambda = 1 \): the lambda-return is the standard MC return.
\( \lambda = 0 \): the lambda-return is the standard TD(0) return; combining MC and TD this way is termed TD(\(\lambda\)).

TD policy learning can be done on-policy (SARSA) or off-policy (Q-learning). We’ll expand on both and provide simple implementations in the next article.